This is a demonstration of visualization of the method used in Topological stratification of continuous genetic variation in large biobanks by Diaz-Papkovich et al. We use data from the Thousand Genomes Project (1KGP) to generate static images via ggplot2 and interactive plots using plotly. This script does not generate UMAP or HDBSCAN data—these are created in the accompanying Python scripts.
The first portion of the script creates a static plot; the second creates an interactive plot.
This is a pre-amble that sets up libraries, colours, directories, helper functions, etc. The data used here is stored in the data subfolder.
The directories below are set up to use pre-generated data. You may need to change them if you use the driver scripts (call_umap_script.sh and call_hdbscan_script.py) without changing the directories specified there.
We generate a static plot in this code chunk. The basic idea is simple: first plot the points, then colour them in based on their cluster label (from HDBSCAN) or sampling population label (from the 1KGP itself). There are some extra steps involved in colouring and label placement: we calculate the mean position of all points in a cluster to position its label; in some cases, labels need to appear multiple times (i.e. split clusters), so we have to manually define their positions—otherwise, the label can appear in the middle of nowhere.
# Image parameters (dot size, alpha)
s <- 0.1
a <- 0.4
# Clustering file to use
f <- "hdbscan_labels_min25_EPS0.5_1000G_UMAP_PC16_NC5_NN50_MD0.01_euclidean_2019814225811"
clusters <- import_clusters(cluster_dir, f)
main_data <- data.frame(cbind(ids, coords))
colnames(main_data)[1] <- "ID"
main_data$ID <- as.character(main_data$ID)
main_data <- main_data[,-c(2)]
# add population labels
main_data <- merge(main_data, samples, by.x = "ID", by.y = "sample")
# add cluster labels
main_data$cluster <- clusters$cluster
# Calculate the mean of the cluster labels (to place the labels)
pop_means <- get_cluster_means(main_data, "x1", "x2", "pop")
# Note that in this 2D visualization, the ITU and GIH populations are split into multiple clusters
pop_means <- subset(pop_means, !(pop %in% c("ITU","GIH")))
pop_means <- rbind.data.frame(pop_means, list("ITU",-7,5))
pop_means <- rbind.data.frame(pop_means, list("ITU",15,16))
pop_means <- rbind.data.frame(pop_means, list("GIH",-7,5))
pop_means <- rbind.data.frame(pop_means, list("GIH",2,15))
ggplot(main_data, aes(x=x1, y=x2, colour=pop)) +
geom_point(size=s, alpha=a) +
scale_color_manual(name = "pop", values=pal) +
ggtitle("Thousand Genomes, coloured by sample population") +
xlab("UMAP1") + ylab("UMAP2") + theme_bw() +
theme(legend.position = "none") +
geom_label_repel(data = pop_means,
aes(x=x1, y=x2, label=pop, colour = as.factor(pop)), size=5, alpha=0.6, label.size=0.05)
# In this visualization, cluster 16 (mostly ITU individuals) is split into multiple clusters
cluster_means <- get_cluster_means(main_data, "x1", "x2", "cluster")
cluster_means <- subset(cluster_means, cluster!=16)
cluster_means_extra <- cluster_means[FALSE,]
cluster_means_extra <- rbind.data.frame(cluster_means_extra, data.frame(cluster="16",x1=-4.3,x2=5))
cluster_means_extra <- rbind.data.frame(cluster_means_extra, data.frame(cluster="16",x1=15,x2=14))
# Colour palette
colourCount <- length(unique(main_data$cluster))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
colpal <- colorRampPalette(brewer.pal(8, "Set1"))(colourCount)
# manually change some colours:
colpal[15] <- "#FFBC2D"
# randomly permute the colours for clarity.
# sequential clusters are often (1) next to each other and (2) similarly coloured.
colpal <- sample(colpal)
# Plot the clusters
ggplot(main_data, aes(x=x1, y=x2, colour=cluster)) +
geom_point(size=s, alpha=a) +
scale_color_manual(values = colpal) +
geom_label_repel(data = cluster_means,
aes(x=x1, y=x2, label=cluster, colour = as.factor(cluster)), size=5, alpha=0.6) +
geom_label(data = cluster_means_extra,
aes(x=x1, y=x2, label=cluster, colour = as.factor(cluster)), size=5, alpha=0.6) +
ggtitle("Thousand Genomes, coloured by clustering algorithm") +
xlab("UMAP1") + ylab("UMAP2") + theme_bw() +
theme(legend.position = "none")
In this chunk we generate interactive plots. One plot is coloured by 1KGP population label, while the other is coloured by cluster. Hovering over the points will show the relevant information for each individual.
s <- 2
main_data <- data.frame(cbind(ids, coords))
colnames(main_data)[1] <- "ID"
main_data$ID <- as.character(main_data$ID)
main_data <- main_data[,-c(2)]
# add population labels
main_data <- merge(main_data, samples, by.x = "ID", by.y = "sample")
# add cluster labels
main_data$cluster <- clusters$cluster
# unassigned boolean
# The clustering for this demo has been selected to not have any unclustered individuals
main_data$unassigned <- ifelse(main_data$cluster==-1,"unassigned","assigned")
colourCount <- length(unique(main_data$cluster))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
fig_main <- plot_ly(data = main_data, x = ~x1, y = ~x2,
marker = list(size=s)) %>%
add_markers(color = ~pop, colors = pal,
text = ~paste( "ID", ID, "\nCluster", cluster),
type = "scattergl", mode = "markers")
## This version of 'shiny' is designed to work with 'htmlwidgets' >= 1.5.
## Please upgrade via install.packages('htmlwidgets').
fig_main
fig_clusters <- plot_ly(data = main_data, x = ~x1, y = ~x2,
marker = list(size=s)) %>%
add_markers(color = ~cluster, colors = getPalette(colourCount),
text = ~paste( "ID", ID, "\nPopulation", pop),
type = "scattergl", mode = "markers")
fig_clusters
This chunk of code looks at specific populations to see how many of their individuals wind up in different clusters.
print_pop_table <- function(p, in_data){
# Print the cluster membership of a specified population
cat(paste(p, "below"))
temp_data <- in_data %>% filter(pop==p)
print(table(temp_data$cluster))
cat("Proportion of population in largest cluster:")
print(max(table(temp_data$cluster))/sum(table(temp_data$cluster)))
cat("\n")
}
for (p in unique(main_data$pop)){
print_pop_table(p, main_data)
}
## GBR below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 105 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## FIN below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 104 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## CHS below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 171
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## PUR below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 145 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9731544
##
## CDX below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 18 19 20
## 104 0 0
## Proportion of population in largest cluster:[1] 0.9904762
##
## CLM below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 4 142 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9726027
##
## IBS below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 162 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## PEL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 128 0 0 0 0 1 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9922481
##
## PJL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 95 12 0 0 0 0
## Proportion of population in largest cluster:[1] 0.6129032
##
## KHV below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
## 18 19 20
## 118 0 0
## Proportion of population in largest cluster:[1] 0.9752066
##
## ACB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 122 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## GWD below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 179 1
## Proportion of population in largest cluster:[1] 0.9944444
##
## ESN below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 172 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## BEB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 133 0 0 0 0 0 0 0 10 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9300699
##
## MSL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 122
## Proportion of population in largest cluster:[1] 1
##
## STU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 124 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.96875
##
## ITU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 109 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9237288
##
## CEU below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 183 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## YRI below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 1 181 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9945055
##
## CHB below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 105
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## JPT below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 0 0 0 0 0 0 0 104 0 0 0 1
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9904762
##
## LWK below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## ASW below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 0 0 0 103 0 0 0 0 1 3 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 0.9626168
##
## MXL below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 7 0 0 0 0 0 0 0 97 0 0 0 0 0 0 0 0 0
## Proportion of population in largest cluster:[1] 0.9326923
##
## TSI below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 0 0 0 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 18 19 20
## 0 0 0
## Proportion of population in largest cluster:[1] 1
##
## GIH below
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 0 0 0 0 0 69 0 0 0 0 0 0 0 0 0 41 1 0 0 0 0
## Proportion of population in largest cluster:[1] 0.6216216